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From finite-system entropy to entropy rate for a 

Hidden Markov Process 

Or Zuk, Eytan Domany, Ido Kanter and Michael Aizenman 



Abstract — A recent result presented the expansion for the 
entropy rate of a Hidden Markov Process (HMP) as a power 
series in the noise variable e. The coefficients of the expansion 
around the noiseless (e = 0) limit were calculated up to 11th 
order, using a conjecture that relates the entropy rate of a HMP 
to the entropy of a process of finite length (which is calculated 
analytically). In this communication we generalize and prove 
the validity of the conjecture, and discuss the theoretical and 
practical consequences of our new theorem. 

Index Terms — Hidden Markov Processes, Entropy rate 



I. Introduction 

LET {Xn} be a finite state stationary Markov process 
over the alphabet £ = {1, .., s}, and let {Y/v} be its 
noisy observation (on the same alphabet). The pair can be 
described by the Markov transition matrix M — M sxs = 
{rriij} and the emission matrix R — R s xs, which yield the 
probabilities P(Xn+i = j\Xjy = i) = rriij and P(Yn = 
j\Xn = i) = rij. We consider here the case where the signal 
to noise ratio (SNR) is small and M is strictly positive {rriij > 
0) and thus has a unique stationary distribution. For the 'high 
- SNR' regime one may write R = I + eT, where e > is 
some small number, / is the identity matrix, and the matrix 
T = {tij} satisfies tu < 0, tij > 0, Vi ^ j and Ylj=i = 0. 
The process Y can be viewed as an observation of X through 
a noisy channel. It is an example of a Hidden Markov Process 
(HMP), and is determined by the parameters M, T and e. 
More generally, HMPs have a rich and developed theory, and 
enormous applications in various fields (see [1], [2]). 
An important property of Y is its entropy rate. The Shannon 
entropy rate of a stochastic process ([3]) measures the amount 
of 'uncertainty per-symbol'. More formally, for i < j let [X]\ 
denote the vector (Xj, ■ ■,Xj). The entropy rate is defined as: 

#([y]f) 



H(Y) = lim 



(1) 



Where H(X) = - J^x p ( x ) logP(X); We will sometimes 
omit the realization x of the variable X, so P(X) should 
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be understood as P(X = x). For a stationary process the 
limit in Q exists and H can also be computed via the con- 
ditional entropy ([4]) as: H(Y) = limw^oo H(Y N \ [Y]f 
Here H(U\V) represents the conditional entropy, which for 
random variables U and V is the average uncertainty of 
the conditional distribution of U conditioned on V, that is 
H (U\V) = J2 V P{U = u)H(U\V = v). By the chain rule 
for entropy, it can also be viewed as a difference of entropies, 
H{U\V) = H(U, V)-H(V). This relation will be used below. 
There is at present no explicit expression for the entropy rate 
of a HMP ([1], [5]). Few recent works ([5], [6], [7]) have dealt 
with finding the asymptotic behavior of H in several regimes, 
albeit giving rigorously only bounds or at most second ([7]) 
order behavior. Here we generalize and prove a relationship, 
that was posed in [7] as a conjecture, thereby turning the 
computation presented there, of if as a series expansion up to 
11th order in e, into a rigorous statement. 

II. Theorem Statement and Proof 
Our main result is the following: 

Theorem 1: Let H N = H N (M,T,e) = H([Y]f) be the 
entropy of a system of length N, and let Cat = Hn — ifjv-i- 
Assume 1 there is some (complex) neighborhood B p (0) C C 
of zero, in which the (one-variable) functions {Cn} and H 
are analytic in e, with a Taylor expansion given by: 



H(M, T, e) 



fc=0 



C (k) € k (2) 



C N {M,T,e,- ^ N 

(The coefficients C\? are functions of the parameters M and 
T. From now on we omit this dependence). Then: 



(3) 



Cjv is an upperbound ([4]) for H. The behavior stated in Thm. 
[2 was discovered using symbolic computations, but proven 
only for k < 2 , in the binary symmetric case ([7]). Although 
technically involved , our proof is based on two simple ideas. 

It is easy to show that the functions Cjv are differentiable to all orders in 
e, at e = 0. The assumption which is not proven here is that they are in fact 
analytic with a radius of analyticity which is uniform in N, and are uniformly 
bounded within some common neighborhood of e = 
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First, we distinguish between the noise parameters at different 
sites. We consider a more general process {Z^}, where 
Zi's emission matrix is Ri = I + t{I '. The process {Zjy} 
is determined by M,T and . We define the following 
functions: 



F N (M,T,[e]?) = H([Z]»)-H([Z\i 



iN-l\ 



(4) 



Setting all the e/s equal reduces us back to the Y process, so 
in particular Fjsr(M, T, (e, .., e)) = CV(e). 
Second, we observe that if a particular is set to zero, the 
observation Zi equals the state Xi. Thus, conditioning back 
to the past is 'blocked'. This can be used to prove: 
Lemma 1: Assume ej — for some 1 < j < N. Then: 

F N ([e]?)=F„- j+1 ([e]? +1 ) 
Proof: F can be written as the sum: 

F\ = > • PiWT-^PiZNlWf-^logPiZNliZ]?- 1 ) 



E 

[Z]» 



(5) 

Here the dependence on [e]^ and M,T is hidden in the 
probabilities P(..). Since ej = 0, we must have Xj = Zj, and 
therefore (since X is a Markov chain), conditioning further to 
the past is 'blocked', that is: 



ej = => P(Z N \[Z]?- L ) = P(Z N \[Z 
Substituting in eq. |5] gives 

Fn 



lN-l\ 



(6) 



E 



P([Z]^- 1 )P(Z N \[Z}f- 1 )\ogP(Z N \[Z}f- 1 ) = 
]T P([Z}f )\ogP(Z N \[Z]f- 1 ) = F N _ j+1 (7) 



Let k = [k]i be a vector with k t G {NUO}. Define its 'weight' 
as u)(k) = 2~2iLi ki- Define also: 

d^F N 



Fn = 



de\\..,de 



N 



(8) 



e=0 



As we now show, adding zeros to k leaves F^ unchanged : 

Lemma 2: Let k = [k]i with ki < 1. Denote k^ the 
concatenation: fc' r ) = (0, .., 0, hi, .., k^). Then: 



F_ 



-N 



,Vr e N 



Proof: Assume first k\ = 0. Using lemma ^ we get: 



"N+r([ e ]l +r ) 



d»( k ^F r+N ([e]?+ r ) 



d^F N {[e}?+{) 



ut r+2> ■■' Ut r+N 



(9) 



The case k\ — 1 is reduced back to the case k\ — by 
taking the derivative. We denote [Z]^ the vector which 
is equal to \Z]f in all coordinates except on coordinate j, 
where Zj = r. Using eq. [9] we get: 



^AT+lU e Jl ) 



r)f kl r)e kN 

ue 3 . . . oe N+1 



N+l 



de 2 



£2=0 



e=0 



ue 3 . . . oe N+1 



E 

r=l [z] n + i 



P([Z]^ l(2 ^ r) )\o g P(Z N+1 \[Z}^- 



p{z N+1 \[z]»)p{[z]f^ r) ) 



de k 



a few 



E^E 



PC^lf^^logP^l^f- 1 )- 



P(Z Ar |[Z]f- 1 )F([Z]f (1 ^ r) ) 



£1=0 



(10) 



CJlt is obtained by summing F^ on all k's with weight fc: 



(fc) 

AT 



(11) 



AT 

E f n 

k,uj(k)—k 

The next lemma shows that one does not need to sum on all 
such fc's, as many of them give zero contribution: 

Lemma 3: Let k = [k]^ . If 3i < j < N, with fcj > 1, kj < 
1, then F$ = 0. 

Proof: Assume first kj — 0. Using lemma [2 we get 



F 



- _ d^F N (e) 



Be kl f)f kN 



d»®F N _ j+1 ([e\?) 



e=0 



de\\..,de k » 



e=0 



dF N _ J+1 ([e]«) 



da 



Assume now kj — 1. Write the probability of Z: 

p(iz}?)=j2 p (w?) p (w?\w? 



= (12) 



e=0 



N 



(13) 



E p ([^)n(<^+ e ^) 

where (5 is Kronecker's delta. Differentiate with respect to 

ap([z]f) 



e=0 



dej 
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E 



P{[X]?)t XjZj Y[(S XiZi + eit Xi Zi) 



£,'=0 



lr=l 

Using Bayes' rule P(Z N \[Z]^ L ) 
dP{Z N \[Z^^) 




<9e, 



e i= 



(14) 



Let k = [k]i with u(k) = k. Define its 'length' as 1 0c) 
Y ■ I min.fe ( >i{i}. It easily follows from lemma [5] that 
Pf ^ => Z(fc) < [^±1] - 1. Thus, according to lemma|2] 



i+i 



,..,fcjv) 



(18) 



for all fc's in the sum. Summing on all with the same 
'weight' gives = C^ y 



> \^]. But from 



the analyticity of Cat and H near e = it follows that 



lim7v-»oo C 



CW, therefore C%> = C [k \ V7V > [^]. 



p([z]r L ) s 



p(^i[2]f- 1 )p([^][ 



lJV-lO'- ,r )> 



e i= 



(15) 



This gives: 



a[P([z]f)iogP(z JV |[z]f- 1 )] 



9e, 



c,=0 



r=l 



And therefore: 



p(z w |[z]f- 1 )P([z]f- l( ^ r) 



'} 



1=0 

(16) 



dp 



JY 



<9e, 



ej=0 



5>J e [pct^f^Vogp^i^r 1 ) 

l[Z]f 



r=l 



PiZNiiztf-^paz]?- 1 ^) 



j2t Xir E [*W (1 ^ ) )io g p(^i[^]f- 1 )- 



P(Z w |[Z]f- 1 )P([Z] J / 



(17) 



£1=0 



The latter equality comes from using eq.|6] which 'blocks' the 
dependence backwards. Ea.ll7lshows that does not appear in 



0F N 



for i < j, therefore dk '^ lF " 



and F* 



0. 



We are now ready to prove our main theorem: 
Proof: 



III. Conclusion 

Our main theorem sheds light on the connection between 
finite and infinite chains, and gives a practical and straightfor- 
ward way to compute the entropy rate as a series expansion 
in e up to an arbitrary power. The surprising 'settling' of the 
expansion coefficients C$ = for N > \^], hold 

for the entropy. For other functions involving only conditional 
probabilities (e.g. relative entropy between two HMPs) a 
weaker result holds: the coefficients 'settle' for N > k. One 
can expand the entropy rate in several parameter regimes. As 
it turns out, exactly the same 'settling' as was proven in Thm. 
[2 happens in the 'almost memoryless' regime, where M is 
close to a matrix which makes the Xi's i.i.d. This and other 
regimes, as well as the analytic behavior of the HMP ([8]), 
will be discussed elsewhere. 
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